Fix race condition: subscriber misses publisher-connected notification (No such feed)#1254
Fix race condition: subscriber misses publisher-connected notification (No such feed)#1254vladopol wants to merge 1 commit into
Conversation
When subscribers reconnect simultaneously with publishers (e.g. after a proxy timeout), the publisher's webrtcup event fires before the subscriber's waiter is created. The notification is lost, the subscriber waits until context timeout, and the call participant cannot hear others. Two issues fixed: 1. async/notifier.go: add sticky notification with 30-second TTL. Notify() now records the notification time. NewWaiter() checks if the key was notified recently and returns an already-closed channel so the caller retries immediately instead of waiting indefinitely. 2. sfu/janus/subscriber.go: move waiter creation inside the retry loop. The original code created one waiter before the loop and reused it, causing a tight busy-loop when the waiter channel was pre-closed by the sticky notifier. Creating a fresh waiter per retry eliminates this. Observed in production: clusters of 10-44 "No such feed" errors per reconnect event affecting 2-5 users, occurring ~10x/day. After fix: 0.
Production validationRunning this patch in production for 5 days on a deployment with ~30 concurrent users (Nextcloud Talk 23.0.5, HPB 2.1.1, Janus 1.4.0, Talk Desktop v2.1.x on Windows):
The reconnect storms that triggered the race were caused by simultaneous WebSocket drops (clients that connected at similar times hit a proxy timeout together). After the fix, multiple identical reconnect events produced zero "No such feed" errors. The 30-second TTL in A dedicated issue with full reproduction details has been opened: #1257 |
|
Thanks, I will be traveling for the next two weeks and will take a look when I'm back. |
|
Added an alternative fix in #1263 that doesn't need to remember the signaled events using a TTL map but keeps the notification object alive while the publisher still exists. This makes it clearer to reason about how long events are kept alive, especially in case of clients stopping and starting to publish. Included a test that makes sure the publisher receives and signals the connected event before the subscriber waits for it (which would trigger the race without the change). Would be great if you could test this in your environment, too. Thanks! |
PR: Fix "No such feed" race condition on subscriber reconnect
Problem
When multiple clients reconnect simultaneously (e.g. after a proxy timeout or network
interruption), subscribers frequently fail to receive audio/video from publishers with
No such feed (N)error from Janus VideoRoom, leaving them unable to hear otherparticipants despite the call appearing active.
Root cause: lost wakeup in
async.NotifierThe retry logic in
sfu/janus/subscriber.gorelies onnewPublisherConnectedWaiterwhich uses
async.Notifierto wait until the publisher signals it is connected (viawebrtcupevent →notifyPublisherConnected).The race condition:
async.Notifier.Notify()only wakes existing waiters; if none exist at the time ofnotification, the event is permanently lost. Subscribers created after the publisher
connected never receive the signal and wait until context timeout.
Secondary issue: busy-loop after sticky fix
The original code creates the waiter once before the retry loop:
If
NewWaiterreturns an already-closed channel (sticky behavior), subsequentWait()calls on the same waiter return immediately → tight busy-loop spammingJanus with join requests until context expiry.
Fix
1.
async/notifier.go— sticky notifications (30-second TTL)Notifiernow remembers the timestamp of the lastNotify()call per key.NewWaiter()checks: if the key was notified within the last 30 seconds, it returnsan already-closed channel so the caller retries immediately instead of waiting
indefinitely.
Reset()clearsnotifiedAttogether with the waiter maps to prevent stale signalsafter a publisher leaves and rejoins.
2.
sfu/janus/subscriber.go— waiter created inside retry loopMove
newPublisherConnectedWaiterinside theJANUS_VIDEOROOM_ERROR_NO_SUCH_FEEDcase, creating a fresh waiter on every retry iteration:
retry: // join attempt case janus.JANUS_VIDEOROOM_ERROR_NO_SUCH_FEED: waiter, stop := p.mcu.newPublisherConnectedWaiter(p.publisher, p.streamType) if err := waiter.Wait(ctx); err != nil { stop() ... } stop() goto retryThis eliminates the busy-loop: each iteration creates a new waiter. If the publisher
was already connected (sticky signal within TTL), the new waiter returns immediately
and the retry join succeeds. If the publisher is not yet connected, the waiter blocks
until the signal arrives.
Observed behaviour after fix
Production system (Nextcloud Talk 23.0.4, spreed-signaling 2.1.1, Janus 1.4.0):
No such feederrors in clusters of 10–44 per reconnect event,affecting 2–5 users per incident, occurring ~10 times per day during business hours
No such feed = 0across multiple reconnect eventsTesting
The race can be reproduced by:
(e.g. restart the signaling server or expire proxy timeout)
Without fix: some subscribers get
No such feedand never receive audio.With fix: all subscribers successfully subscribe after retry.